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Abstract 

Reliable and fast builds are essential for rapid turnaround 
during development and testing. Popular existing build 
systems rely on correct manual specification of build 
dependencies, which can lead to invalid build outputs 
and nondeterminism. We outline the challenges of de- 
veloping reliable build systems and explore the design 
space for their implementation, with a focus on non- 
distributed, incremental, parallel build systems. We de- 
fine a general model for resources accessed by build tasks 
and show its correspondence to the implementation tech- 
nique of minimum information libraries, APIs that re- 
turn no information that the application doesn't plan to 
use. We also summarize preliminary experimental re- 
sults from several prototype build managers. 

1 Introduction 

Large software projects often reach thousands of files 
and millions of lines of source code. Build automation 
systems, or build systems for short, are responsible for 
automating the execution of build tools such as compil- 
ers in order to process all the source code and produce 
the final, executable output. The time required to execute 
a build is a critical factor in a number of software engi- 
neering metrics such as: developer cycle time, frequency 
of continuous integration testing, throughput of check-in 
verification systems, and time to ship a critical patch; yet 
a 2003 survey showed that more than half of the 30 sur- 
veyed commercial projects had a clean, sequential build 
time of 5-10 hours. [ 1 1] This motivates the development 
of builds that can run faster than a clean build. 

To address this need, existing build systems pro- 
vide two features: parallel builds, in which multiple 
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build tasks are executed simultaneously, and incremental 
builds, in which results of previous builds are reused and 
only a subset of build tasks are run, based on what build 
inputs have changed. In both types of builds, the devel- 
oper must explicitly specify dependencies for each build 
task, describing other build tasks which must run before 
it. For example, in a C project, C source files must be 
compiled into object files before the object files can be 
linked into an executable binary. If even one dependency 
is omitted, the soundness of both parallel and incremen- 
tal builds is compromised: build tasks may be run out of 
order, leading to incorrect re-use of out-of-date results, 
build failure due to missing results, and race conditions 
due to concurrent access to files. Whether a failure oc- 
curs, and which failure occurs, depends on which input 
files have changed and the build schedule selected by the 
build system. As a consequence, "[m]ost organizations 
run their builds completely sequentially or with only a 
small speedup, in order to keep the process as reliable 
as possible." [11 J If developers and organizations viewed 
their parallel, incremental builds as highly reliable, they 
could use them consistently throughout the development, 
testing, and release process, accelerating these processes 
and offloading the mental burden of build management. 
Incomplete dependencies arise naturally whenever a 
developer change introduces a new dependency, but fails 
to correctly update the dependency information. As a 
simple example, consider the build described by this 
makefile: 

all: generated. h foo 

generated. h: config 

./gen config -o generated. h 

foo: foo.c 

gee foo . c -o foo 

Here, a tool called gen is run to generate the header file 
generated.h from a file config; then the binary foo is com- 



piled from the C source file foo.c. Now suppose the de- 
veloper modified foo.c to include the header file gener- 
ated. h, and also modified config. A serial build will still 
produce the expected result, since generated. h is listed 
before foo in the "all" target; but an incremental or par- 
allel build may run the gcc action before, or simultane- 
ously with, the gen action, leading to incorrect output or 
build failure. 

This work explores background and existing work in 
build systems and obstacles and design options for reli- 
able build systems. It also presents a formal model for 
build system analysis and discusses some early experi- 
mental results with several prototypes. 



2 Background 

Dependencies in a build are described by a dependency 
graph, a directed acyclic graph (DAG) where build tasks 
(typically, invocations of a build tool) are vertices, and an 
edge from A to B indicates that B depends on A. Given 
such a graph and a uniform set of processors, deciding 
which tasks to run at what time is an instance of the DAG 
scheduling problem, which is studied in the context of 
static scheduling of processes in high-performance com- 
puting. It is NP-complete even in the restricted case 
where there are two processors, no dependencies, and 
the run time of every task is known (closely related to 
the partition problem), but a number of effective heuris- 
tics are available in practice. 

A single-node build can be scheduled using a topo- 
logical sort, which can be computed by a simple online 
algorithm: at each step, select an arbitrary vertex with no 
incoming edges to run, and when it completes, delete it. 
A similar algorithm can schedule parallel builds: when- 
ever at least one processor is free, run an arbitrary task 
with no incoming edges, and whenever a build step fin- 
ishes delete its vertex. It is possible that all tasks have in- 
coming edges, in which case processors may remain idle 
until more tasks complete. This algorithm, used by make, 
is a version of Graham's classical online list scheduling 
algorithm, [4| and has the advantage of not requiring task 
runtimes, but does not take into account the critical path 
(the path of largest total time). 

The technique can be improved by assigning priori- 
ties to nodes, using any of a number of heuristics, and 
then selecting the node with the highest priority at each 
step. ifTBi Effective priority assignment requires task run- 
time estimates, which can be inferred from previous 
builds and/or a runtime model. This approach has not 
been yet tried. 



2.1 Shared state and resources 

To model builds we define the shared state space S, typ- 
ically representing the filesystem and other state visible 
to multiple tasks as well as the task input (e.g. com- 
mand line, environment). A resource is a function r with 
domain S. Intuitively, a resource is anything that may 
be returned by a library function. Resources can range 
from simple predicates ("does this file exist?") to val- 
ues ("what are the contents of the file at this path?") to 
complex operations ("what is the abstract syntax tree ob- 
tained after parsing the source file at this path?"). A re- 
source can also encompass many files (such as the con- 
tents of all files in a subdirectory). Prior to starting 
the build, a fixed (typically infinite) resource space is 
selected — no build process may access resources outside 
that set. 

A build task performs a sequence of accesses (reads 
or writes) to resources. During a parallel build, accesses 
by many tasks may be interleaved to form an access se- 
quence, subject to the constraint that if g depends on /, 
all accesses by / precede all accesses by g. Reads make 
the current value of a resource accessible to the task ex- 
ecuting it, while writes update the shared state in such 
a way that one or more resources are set to a new given 
value. Any resources not written to during a write must 
remain unmodified. 

Build tasks must be deterministic, in the sense that 
their accesses (including type, resource, and value writ- 
ten) depend only on the results of prior reads. Two tasks 
are said to conflict (during a particular build) if one of 
them writes a resource that the other reads or writes. A 
given build is valid if, for any pair of conflicting tasks, 
there is a directed path from one to the other in the de- 
pendency graph. It can be proven that if a given build is 
valid, it produces the same final result as any other paral- 
lel schedule, given the same initial shared state (see ap- 
pendixlAj. This allows us to meaningfully define a valid 
configuration as a pair (dependency graph, start state) 
that produces vahd builds. 

To model an incremental build, suppose we start with 
initial state Si, perform a build resulting in state s/, mod- 
ify the shared state to get s'^, and then perform another 
build. For now, we assume that for every task /, / has 
no effect when acting on Sf — that is, right after the first 
build is complete, re-running any one step will change 
nothing (in practice, this typically means retaining and 
not reusing intermediate files). Define the special task d 
updating state s/ to s'^, representing the actions of the 
developer, and add edges from d to all tasks that d con- 
flicts with. Now we assign d the lowest priority and cre- 
ate a DAG schedule. This will move all nodes that don't 
conflict with d before d, where they will have no effect, 
since they are acting on Sf. Effectively, this means the 



only part of the graph that needs to be scheduled is the 
transitive closure of d. 



2.2 Selecting a resource space 

There is a tradeoff in the choice of the resource space: if 
resources encompass too much state, there will be spuri- 
ous conflicts. For example, a trivial resource space has a 
single resource returning the entire shared state. In this 
space, all reads conflicts with all writes, and the build 
must run sequentially. 

On the other hand if resources are too fine-grained, the 
result will be that processes read and write a very large 
number of resources, resulting in excessive overhead for 
build management and a large dependency graph. For 
example, if every byte of every file had its own resource, 
a typical build task would access many thousands of re- 
sources. 

One straightforward strategy is to create a single re- 
source for the contents of each file on the disk. To ac- 
count for the creation and deletion of files, there is a re- 
source for every possible filepath, with a special value 
indicating the file does not exist or is inaccessible, analo- 
gous to a "read file contents" library function that returns 
NULL on failure. This simple resource system is similar 
to that used by make and is sufficient for many builds. 

Many applications require a notion of a collection/set 
resources, such as a directory. A naive representation 
would have a resource for the contents of each collec- 
tion; but then two tasks creating files in the same direc- 
tory would conflict. Such a collection is best represented 
as an infinite set of resources, one for each potential ele- 
ment of the collection, indicating whether or not that el- 
ement is present (in the case of a directory, one for each 
filename, indicating whether that file exists in that direc- 
tory). A process that reads the collection (e.g. listing the 
files in the directory) reads all of these resources (note 
that this requires a concise representation for certain infi- 
nite resource sets). A process that adds or removes items 
from the collection may only affect a few of them. 

Although files are by far the most common resource, 
there are many examples of other resources that are use- 
ful. For example, the Linux kernel build has a single 
header containing all configuration options which is in- 
cluded by all source files. In order to make incremen- 
tal builds useful in the event of configuration option 
changes, the Linux build tracks each option as a separate 
resource. 

A set of resources in a resource space may be con- 
tracted to form a merged resource which yields a tuple 
of all the resources used to form it. Such contracted re- 
sources allow a gradual tradeoff between the number of 
resources accessed and the number of conflicts that occur 
during the build — see section|8]for more details. 



2.3 Hidden resources 

There are resources that are used in practice by many 
tools but are not tracked by existing build managers, ei- 
ther by convention or because supporting them is diffi- 
cult. These include: 

• Compiler flags and tool configuration: if a build is 
done, and then tool configuration is altered, for ex- 
ample to enable debugging flags, all files must be 
rebuilt. If it is changed back, there is no need to 
rebuild everything again. Visual Studio implements 
solution configurations with separate output direc- 
tories to cope with this, but these are rarely used 
for more than two configurations. Vesta [6 1 records 
outputs of many previous builds in its derived file 
cache. 

• Nonexistent files: A C source file reading 
"# include <stdio .h>" will search the system 
include path in order to find the header. Develop- 
ers often add project directories to this path. If a 
file named "stdio.h" were ever created along this 
path, it would change the result of the task, but 
most extant tools would not detect the need to re- 
build. Vesta l'6'l and scons fS) track dependencies 
on nonexistent files. 

• Build configuration file: determining which part of 
a build needs to be rebuilt after changing the build 
configuration file itself (e.g. Makefile) is a difficult 
problem. Even small changes may affect all tasks 
or only a few, and determining which may require 
analyzing structural changes since the previous ver- 
sion. 

• Build tools, libraries, and system headers: upgrad- 
ing build tools or libraries used by build tools, or 
copying a source tree to a machine with different 
tools, can dramatically alter build output, but these 
are usually untracked. Sometimes this results in an 
incompatible combination of files generated by dif- 
ferent versions of tools. This motivates the com- 
mon industry practice of including all build tools in 
the version control repository. As mentioned in sec- 
tion [8] it often makes sense to treat these files as a 
single aggregate resource. 

• Non-file resources: accesses to network resources, 
peripheral devices, the time, and so on are usu- 
ally untracked. Some real-world builds retrieve files 
during the build from remote sources, query remote 
databases, or even do web service queries. These 
should be tracked as resources, even if coarsely. 



• Special files: some files like those under "/proc" and 
"/dev" in UNIX may fail to update their last modi- 
fied time, or even change each time they are read. 

• Operating system: the results of system calls made 
to the kernel by build tools may affect build out- 
put. These results may vary depending on the spe- 
cific operating system, operating system version 
and patches, filesystem and drivers, or even kernel 
configuration options. These are untracked by all 
extant systems, and largely benign given a carefully 
designed resource space and a standards-compliant 
operating system. 

The choice of how to handle hidden resources depends 
on the resource space, the application, and build platform 
variability. Some applications may not use certain types 
of resources or may be built only on a fixed build server. 
In some cases, Uke the build configuration file, merely 
detecting any change and triggering a full rebuild may 
be sufficient in practice. In other cases, where changes 
are frequent, fine-grained resource tracking is needed. 

3 Related work 
3.1 Build systems 

A small number of build systems dominate in practice 
today, most of them based on make, created by Stuart 
Feldman in 1977 at Bell Labs. |9| With make, the devel- 
oper uses a domain-specific language to specify a series 
of targets, and each target may declare explicit depen- 
dencies on other targets and/or source files. Each target 
has an associated shell command that builds the target. 
This explicit representation of the dependency graph fa- 
cilitates both incremental and parallel builds. However, 
dependencies must be specified correctly; if they are not, 
incremental builds may fail to rebuild portions of the ap- 
plication, leading to incorrect results with unpredictable 
behavior, and parallel builds may produce different out- 
puts nondeterministically. Make is designed for use on a 
single machine, and build results are not shared between 
developers. A number of important dependencies are ei- 
ther difficult to represent or omitted by convention, such 
as the ones mentioned in section 12.3] — changes in these 
may require a complete rebuild. Even incremental builds 
in make take time proportional to the size of the build as 
a whole due to the need to process all targets and scan 
all input files for changes. This process can be acceler- 
ated by using file timestamps to detect changes, at the 
expense of correctness, since this is not reliable in gen- 
eral. Although some build systems like Apache Ant and 
MSBuild adopt XML build description files in place of 
make's domain-specific language, facilitating greater ex- 
tensibility, they still inherit all of these issues. 



One of the most developed research build systems is 
Compaq/Digital Systems Research Center's Vesta, de- 
veloped in the late 1990s and released under the GNU 
LGPL in 2001. 161 Although Vesta does not support 
parallel builds, it provides incremental builds reliable 
enough to be used in practice for product releases ("ev- 
ery build is incremental"). It tracks dependencies that 
extant tools like make incorrectly ignore, such as depen- 
dencies on build description files, compiler flags, nonex- 
istent files, and build tools. Through the derived file 
cache, compilation outputs are easily reused between de- 
velopers. Change detection and inferrence of dependen- 
cies is implemented using a custom filesystem, so that 
the filesystem does not need to be scanned to find modi- 
fied source files, and a sophisticated functional build de- 
scription language allows large portions of the build to be 
reused. |7| Using its derived file cache, Vesta can reuse 
results not only from the previous build but from all pre- 
vious builds, by treating tool executions as functions and 
memoizing their results (see their runtool cache). 

Vesta was deployed by large product teams at Com- 
paq and Intel, but has not achieved widespread use. This 
can be attributed to several factors. One is that Vesta is 
a "package deal," requiring teams who use it to also use 
Vesta's custom filesystem and version control, both of 
which are not as mature, featureful, or well-supported 
as existing systems. Migration of existing projects to 
Vesta while preserving change histories is difficult or im- 
possible, and requires translating existing build descrip- 
tion files into Vesta's very different language. Modern 
builds are done in parallel, even on single nodes, and 
large builds are done on clusters, neither of which Vesta 
supports. Finally, the cost of incorrect incremental builds 
is hidden: it is difficult to measure the time spent by de- 
velopers resolving incortect builds, or the time that might 
have been saved by building product releases incremen- 
tally. 

A central feature of Vesta was repeatability, in which 
all source files used in a particular build can always be 
retrieved at a later time, and used to repeat the same 
build. Although this feature is valuable (e.g. for iso- 
lating source changes leading to behavior changes), it is 
separable from the other features and depends critically 
on integration with version control, so it is disregarded 
in this report. 

A very different approach to build systems was taken 
by Electric Cloud, 111] which disregarded incremental 
builds in favor of using clusters of machines with paral- 
lel processors to speed up full builds as much as pos- 
sible, currently deployed as an enterprise commercial 
product. A network filesystem infers dependencies, and 
visualization tools facilitate the identification of bottle- 
necks. Although fast and well-supported. Electric Cloud 
is not suitable for routine developer builds, does not scale 



down effectively to small projects, and is too expensive 
for many applications such as open-source development. 
More recently, in 2012, Electric Cloud has released 
ElectricAccelerator Developer Edition, ||3] which is de- 
signed to run on a single machine, infers dependencies, 
and implements accurate incremental and parallel builds, 
scaling up to four cores. Although this product effec- 
tively accomplishes the primary goals set out in this re- 
port, it chooses a single design and leaves room for im- 
provement in numerous directions, such as tool cooper- 
ation, sharing of derived files, custom resources, and so 
on. 

3.2 Build augmentation 

A number of more practical efforts have sought to aug- 
ment existing build tools by providing services to accel- 
erate them or improve their reliability. 

The GNU Make manual illustrates how to use the 
"-M" flag of gcc (the GNU C Compiler) to generate 
make dependencies for C/C-H- builds on-the-fly and keep 
them up-to-date automatically. These dependencies are 
incomplete, including only header and source files, but 
greatly increase reliability and reduce maintenance effort 
compared to manual specification for this specific type of 
build. 

The ccache tool, jTSl based on compile rcache, fl4| 
caches results of invocations of standard compiler tools 
like gcc, even if the intermediate files are later deleted 
or overwritten. It can dramatically improve incremental 
build times for C/C-H- projects, but does not generalize 
to other tools, scons fS] provides similar functionality. 

Google relies on conventional distributed builds with 
coarse-grained tasks and manually-specified dependen- 
cies. Their efforts have focused on dramatically reduc- 
ing the runtime of important build tools, such as the 
C/C-H- linker, which is a bottleneck in large parallel 
builds because it is used in the final step to combine all 
results, ini 

4 Build specification 

Systems like make lean heavily on build specification via 
an explicit dependency graph. This has certain advan- 
tages: dynamic scheduling of incremental, parallel builds 
is straightforward as outlined above, and it's also intu- 
itive to create build description files that include multiple 
targets and allow the developer to choose to build only a 
subset of them (and these targets may share dependen- 
cies). 

One of the simplest ways to specify a build is with a 
sequence of shell commands, a basic shell script. Any 
sequential build is equivalent to such a script. Both in- 
cremental and parallel builds can be implemented in this 



setting by inferring dependencies from previous builds 
(see sections |731 17.4b . This scheme can be extended to 
include nonrecursive function calls and variables without 
adding significant complexity. It has the advantage of be- 
ing intuitive and familiar to procedural programmers, but 
unlike explicit dependency graphs becomes less intuitive 
when building a subset of targets. 

The most general type of build specification is the 
build program or build script. Such a script is written in a 
general-purpose language and may employ sophisticated 
abstraction mechanisms, algorithms, and data structures. 
Vesta's functional build language |7 1 and scons's Python 
build descriptions |8 1 are examples. In some cases it may 
even be integrated into the application being built, allow- 
ing the application to generate source code and rebuild 
itself or portions of itself. Incrementalism can be ex- 
tracted using memoization, as in Vesta, and parallelism 
can be extracted using futures. Although the most flexi- 
ble option, automatically extracting incremental and par- 
allelism from a general build program is challenging and 
in some cases infeasible. 

Some practical tools mix these approaches; make for 
example incorporates basic variables and conditionals 
while remaining primarily based on dependency graphs. 
Other hybrids may be possible, such as a Makefile-Uke 
language where both dependency lists and actions can 
be program fragments in a general-purpose language. A 
major goal of future work is to design a build description 
language that can concisely represent typical builds in 
practice, minimizing opportunities for error, but remain 
flexible and scalable enough to accommodate large and 
complex builds. 

5 Capturing access to shared state 

Standard tools such as make rely on the developer to 
manually specify all shared state which is accessed by 
each task, making the system unable to distinguish a 
valid build from an invalid one. There are several tech- 
niques for reliably, automatically capturing access to 
shared state. 

5.1 File system filtering 

It is straightforward to implement a filesystem or net- 
work filesystem server which acts as a proxy, monitoring 
all file operations and mapping them onto an underlying 
filesystem. Some filesystem subsystems, as in Windows 
NT, have explicit support for filters to capture all file op- 
erations, for use by virus scanners and backup utilities. 
To detect conflicts, the system must know which build 
task is performing each file operation, usually inferred 
from the process ID. The technique extends easily to dis- 
tributed build systems. 



This approach was the primary means of capturing de- 
pendencies in Vesta, and is simple to deploy (although 
it typically requires superuser access). Its main disad- 
vantages are that it only captures operations on files and 
only at whole-file granularity, it must be applied to ev- 
ery filesystem a build process could possibly access, and 
that the file API is typically at an inappropriate level of 
abstraction, yielding too much information on each call. 

5.2 System call interception 

On typical MMU-based systems, all access to shared 
state by a process passes through system calls, which 
can be intercepted either through binary rewriting or 
through kernel support for system call interception such 
as ptrace. Unlike file system filtering, system call inter- 
ception can capture all access to shared state including 
all filesystems, the network, and kernel data structures 
(with some minor exceptions like RDTSC, which can be 
disabled). 

One obstacle with system call interception is that typ- 
ical build tools generate very high volumes of system 
calls, many of which are unimportant for dependency 
tracking. In experiments, handling all system calls with a 
central ptrace monitor process led to crippling overhead. 
Binary rewriting suffers from a different performance is- 
sue: load-time rewriting is too expensive for short-lived 
processes, necessitating on-disk caching of instrumented 
binaries. Kernel patches (for ptrace) or in-process filter- 
ing (for binary rewriting) can reduce the number of sys- 
tem calls, but is more difficult to implement and deploy 
and less flexible than minimum information libraries. 

A more fundamental obstacle with system call inter- 
ception is that applications routinely invoke system calls 
that return more information than they require. For ex- 
ample, UNIX applications testing for the existence of a 
file routinely use the stat() system call, which also re- 
turns the last modified and last accessed time of the file, 
which change frequently. Another daunting case is en- 
vironment variables, which are passed to new processes 
as a complete array; there is no way to determine which 
ones are used through the system call interface. Simi- 
larly, an application may read in a database file just to 
use one row of a table, or (as was observed in some open- 
source tools) cache the contents of a directory to acceler- 
ate future queries. To ensure correctness, the build sys- 
tem must assume all the information available to the pro- 
cess is used by it, which leads to unacceptable perfor- 
mance. Dynamic taint tracking, IS) used to track the flow 
of untrusted data in security applications, could be used 
to trace the flow of system call results in-process, but has 
high overhead, and may fail to accurately track complex 
cases, such as an array of environment variables being 
transformed into a hash table data structure. 



5.3 Minimum information libraries 

A minimum information library is a library designed to 
supply the minimum information that will be used by the 
caller and no other information, even in case of error. For 
example, whereas a POSIX application may use stat or 
fopen to determine if a file exists, a minimum informa- 
tion library would supply a fileExists method returning 
a boolean. It would only return true if the file exists, or 
false if it doesn't exist or is inaccessible. Similarly, envi- 
ronment variables would be accessed through get and set 
functions instead of by parsing the environment block. 
These expose fine-grained dependencies in the applica- 
tion while still making the same number of system calls 
under the covers. 

Minimum information libraries have a natural corre- 
spondence to resources as defined in this work: every 
resource can have an associated call in the library that 
reads and (where applicable) writes that resource. Other 
calls may read or write multiple resources. 

A minimum information library can be easily instru- 
mented to acquire one or more resources with every call, 
or to acquire a single resource to serve many calls, avoid- 
ing a proliferation of acquisitions. It can either save this 
information for later analysis, or contact a central build 
manager process to acquire a lock on the resource. By 
eliminating or wrapping all library calls that invoke the 
kernel, all access to shared state can be directed through 
the minimum information library, ensuring that all de- 
pendencies are systematically tracked. 

When an application is written against a minimum in- 
formation library, dependency tracking is simplified, but 
for many build tools that are either binary-only or man- 
aged by third parties, porting to another runtime library 
is a poor investment. For cases like these, a promis- 
ing alternative is the build wrapper, a small tool using 
a minimum information library that replaces the tool and 
acquires any needed resources, then invokes the underly- 
ing tool normally. Such a wrapper often requires only a 
small subset of the functionality implemented by the fuH 
build tool. 

Unlike the other solutions above, minimum informa- 
tion libraries require some work to be done for every 
build tool, including application-specific build tools, and 
bugs in this code can lead to build unreliability. How- 
ever, the number of build tools in a build is very small 
compared to the number of build tasks, typically ranging 
from 1 to 50. For widely-used tools like gcc, the work 
can be shared among many users of the tool and devel- 
oped to maturity, while application-specific build tools 
tend to be very simple, with dependencies inferrable 
from the command line alone. 



6 Change detection 



7.2 Phased dependency specification 



The change detection problem is the problem of captur- 
ing changes to shared state between builds, for the pur- 
pose of implementing incremental builds. Traditional 
build systems like make rely on comparing timestamps 
between task input and output files to determine if a task 
needs to be re-run. This is overconservative, in that un- 
modified files may have updated timestamps; incorrect, 
in that tasks may not be run if timestamps travel back- 
wards (as when restoring from a backup); and inefficient, 
in that all tasks and all their input files must be examined 
even for a small incremental build. New build managers 
like scons [8| rely on hashes of file contents to detect 
changes, fixing the first two problems at the expense of 
even more inefficiency. Moreover, both these approaches 
are ineffective for resources other than simple files. 

Ideally, change detection should log exactly which re- 
sources in the chosen resource space are modified at the 
moment they are modified, making their retrieval trivial. 
This would be straightforward if all applications were 
written against the same minimum information library as 
the build tools, but this is infeasible in practice because 
development tools are generally third party and difficult 
to wrap due to being interactive and long-lived. 

Some kernels support keeping a log of all modified 
files, including NTFS's USN change journal iffOl and 
Linux with Stefan Bttcher's fschange patch. f2] Com- 
bined with a resource database that tracks old values 
of resources, these can be used to detect changes to 
filesystem-based resources as soon as they occur ZFS 
uses Merkle trees to efficiently track hashes of the 
contents of all files at all times, for integrity and de- 
duplication, but this information is not user-accessible 
without a patch. Network-based resources can be inter- 
cepted by packet sniffers, at some overhead. 



7 Specifying and inferring dependencies 
7.1 Manual dependency specification 

Although primitive, manual dependency specification of- 
fers a transparency and flexibility difficult to achieve 
with other methods. If coarse-grained tasks are used (see 
section[8]l, dependencies don't have to be updated too of- 
ten, easing the maintenance burden. In this scenario, the 
primary function of the build manager is to detect invalid 
builds, with error messages suggesting how to repair the 
build description file. It can also optionally warn about 
redundant dependencies. 



An extension of manual dependency specification is to 
have a build that proceeds in phases, where earlier phases 
generate dependencies used by later phases. A simple 
example of this is the typical integration of make with 
gcc -M, where dependency files are generated from 
source files in the first phase, and in the second phase 
source files are compiled using those dependencies. This 
can be extended to more phases in scenarios where tools 
must first be built to generate dependencies. Because 
each phase can be parallelized and incrementalized sep- 
arately, this approach can be similar in performance to 
the manual approach. Some degree of interleaving may 
be possible, but caution is required to ensure that no de- 
pendencies become available after the point where they 
are needed (or alternatively, rollback may be used in this 
case — see section l7~4l below). 

7.3 Offline dependency graph augmenta- 
tion 

An alternate strategy is to infer dependencies based on 
the conflicts observed in an invalid build. If two tasks 
conflict but there is no directed path between them, the 
system can add an edge between them, but needs more 
information to infer the direction of the edge. One sim- 
ple way to supply this information is to give a serial or- 
dering of all tasks — then if A and B conflict, whichever 
comes earlier in the serial order is run first. In the case of 
dynamically scheduled tools, such a serial order can be 
inferred after the fact from any deterministic walk over 
the task execution tree of the build. Once the graph is up- 
dated, the build is re-executed (invalidating the conflict- 
ing tasks to force them to re-execute), and this process 
is repeated until a valid build is observed. Termination 
is guaranteed because eventually the dependency graph 
will contain a path through all tasks, and so necessarily 
be valid. 

Inferred dependencies are stored as derived files that 
can be shared between developers (via a derived file 
cache, or simply through version control). For this rea- 
son, invalid builds are expected to occur infrequently, 
only when source files change in a way that adds depen- 
dencies. 

Because the serial ordering is used to direct dependen- 
cies, the parallel build that results from this algorithm 
will produce the same final result as a sequential build 
of the serial ordering (per the theorem of Appendix lAJ. 
Such a build is predictable, easy to test, and easy to con- 
ceptualize for the developer Compared to manual depen- 
dency specification, dependency inferrence allows more 
concise build description files that require less frequent 
updating. However, unforeseen conflicts may lead to ex- 



cessive edges and build bottlenecks. 

A challenging problem for this strategy is determining 
when to remove inferred dependencies. The build can 
easily detect when there is no conflict between two tasks, 
but it is difficult to establish whether the lack of conflict 
is a short-lived or long-lived phenomenon. For example, 
in a C++ project, there may be a certain header file which 
is only included in debug builds, resulting in dependen- 
cies that appear in debug builds but not in release builds. 
One simple strategy is to periodically erase all inferred 
dependencies and re-run the build to reproduce them. 

7.4 Transaction-based task synchroniza- 
tion 

Another strategy is to prevent any invalid builds from 
occurring by inferring dependencies on-the-fly at run- 
time. Using concepts from database transactions, we 
lock resources before accessing them by submitting a 
lock request to the build manager process. If the re- 
source is already locked, the task is blocked until it is 
available. Tools with build wrappers can lock all neces- 
sary resources before invoking the real tool. However, 
once locks are in use deadlock is possible, and to make 
progress tasks must support abort and rollback, which 
kills the task and undoes its previous effects to the shared 
state. 

By itself, this algorithm will yield an unpredictable or- 
dering of conflicting tasks, leading to nondeterminism in 
build outputs. Suppose we wish instead to produce the 
same final output as the sequential serial build. In this 
case, we can employ a version of multiversion timestamp 
concurrency control [1], placing each task inside a trans- 
action with a virtual timestamp equal to its order in the 
serial build. If a task observes a value that was written 
by a task with a later timestamp, this is termed physically 
unrealizable behavior, and forces an abort and rollback 
of the reader and any tasks influenced by its writes di- 
rectly or indirectly (ordinary multiversion timestamping 
rolls back the writer, but in our scenario this can lead to a 
failure to make progress). Unlike the pessimistic locking 
strategy above this is an optimistic strategy, and so avoids 
blocking tasks at the cost of more frequent restarts. 

8 Task and resource granularity 

Fine-grained tasks allow incremental builds to avoid re- 
dundant work and parallel builds to run more tasks in 
parallel. Generally the most fine-grained task possible is 
an execution of a build process, since such tasks cannot 
be easily subdivided. However, the intuitive association 
of a single process with a task may be counterproduc- 
tive: a large number of processes leads to a large number 
of tasks and a large dependency graph which takes more 



time to construct and analyze. By partitioning this graph 
and collapsing each partition to a single task, the graph 
size can be dramatically reduced with only a modest in- 
crease in incremental build times. There is also little 
to no decrease in parallelism in practice, either because 
the reduced build is still capable of saturating the hard- 
ware's parallelism capacity, or because individual build 
tools support parallel execution. One typical strategy for 
accomplishing this is switching from a "file-based" com- 
pilation method to a "module-based" method, where en- 
tire directories are compiled into static/shared libraries 
or binaries in a single step. Some build tools, like the 
Microsoft Visual C# compiler, exclusively use this ap- 
proach. 

Along with a decrease in graph size, the frequency 
of updates to the dependency graph is lowered, making 
manual graph maintenance more feasible and leading to 
a smaller number of rebuilds. 

Similarly, the intuitive fine-grained association of re- 
sources with individual files can be counterproductive. 
For example, every task has a set of "owned" resources 
that only that task depends on, which can be collapsed 
into a single resource without increasing build times. If 
the tasks are coarse-grained, this can substantially reduce 
graph size. Another important case is the set of sys- 
tem resources, such as build tool executables, that are 
rarely updated and used by nearly all tasks. By collaps- 
ing rarely-updated, widely-used resources into a single 
resource, an enormous number of dependency edges are 
eliminated, and long incremental builds are only needed 
during a system update — at which time a full rebuild is 
needed anyway. 

Decreasing graph size decreases overhead differently 
depending on the system used. In a lock-based system 
with a central build manager, it results in less lock and 
unlock operations and less interprocess communication. 
In a system that logs dependencies, it leads to fewer and 
smaller log files and less time loading them. In a sys- 
tem that performs static DAG scheduling, the scheduling 
algorithm runtime is reduced and an improved schedule 
may become feasible. These optimizations are essential 
to ensure that build overhead does not dominate build 
time. 

In order to achieve these gains, the partitioning must 
be known and available to all tasks before the build be- 
gins. Both task and resource partitioning can be inferred 
by analysis of the dependency graphs of previous builds. 
Task partitioning can also be specified implicitly by de- 
scribing each task using a command sequence or script 
that performs all necessary actions for that task. Re- 
source partitioning can be specified manually, e.g. by us- 
ing directory patterns to distinguish application and sys- 
tem resources. A promising hybrid approach that both 
hmits incremental build time and keeps graph size small 



is to automatically use smaller partitions for resources 
that are modified frequently (e.g. the module the devel- 
oper is currently working on) and increasingly larger par- 
titions for resources that are modified less frequently. 

9 Preliminary experimental results 

Three prototype build systems were constructed. 

In the first, a ptrace-based prototype that could only 
perform full builds, a pessimistic locking scheme was 
used where build processes took locks on any files they 
accessed. Processes also took "predicted locks" on any 
files they accessed during previous builds; predicted 
locks cause processes with later timestamps (which oc- 
cur later in the serial build) to block if they attempt 
to lock the file. This allows cascading rollback to be 
avoided. Given enough concurrent processes, this build 
scaled to 85% the time of a parallel make build of the 
Linux kernel. However, it was not a complete system, 
as it was unable to handle unexpected new dependencies, 
could not perform incremental builds, and inferred its list 
of processes to execute from a prior make run, making it 
necessary to rerun the make build whenever this process 
sequence was changed. 

The major performance bottleneck in this prototype 
was the necessity for the central build monitor process 
to sequentially handle all ptrace messages. A variant of 
this prototype used binary rewriting based on Jockey [fTZJ 
to track system calls without the use of ptrace. Jockey 
rewrites binaries at load time by searching for system 
calls, and also keeps a cache of patches to apply for bina- 
ries it's seen before. In practice, even with caching, the 
system added too much overhead to be practical due to 
the Linux kernel build's enormous number of short-lived 
processes like cp and mkdir. This is less likely to be an 
issue in a more monolithic build system. 

The second prototype was based on multiversion 
timestamping and was able to handle process hierarchies. 
Instead of replacing make, make is run sequentially and 
children of make are run speculatively, pretending to 
succeed so that make will continue and begin the next 
process. Rollback was implemented by performing all 
writes in a temporary location and then committing them 
after a process completes, which can be accomplished 
by rewriting results of system calls (a simple form of 
filesystem virtualization). Although the system was able 
to run real-world builds, and was powerful enough to 
complete builds even given no initial dependency infor- 
mation at all, the overhead of its transaction management 
and filesystem virtualization prevented it from scaling to 
larger builds, particularly since the build manager ran 
sequentially. On very small builds with few dependen- 
cies, it could outperform a sequential make build by 30% 
while offering the same results and rehabihty, but even 



on medium-sized builds this performance advantage was 
lost. In neither case could it compete with parallel make 
builds. 

The system also supports reliable incremental builds: 
it keeps a cache similar to Vesta's runtool cache, and 
whenever a process is re-executed with the same inputs 
as in a prior run, it skips running the process and com- 
mits its cached results. Although its incremental builds 
are much faster than its full builds, they are not compet- 
itive with incremental builds by make, for several rea- 
sons: the main make process is still run as it would be in 
a full build, input files have to be hashed to implement 
the cache reliably, and the filesystem virtualization (par- 
ticularly committing cached results) is expensive. 

The third prototype abandoned transactions and sys- 
tem call tracing in favor of cooperation with build tools. 
A variety of open-source build tools were instrumented 
to declare their dependencies at runtime using a C li- 
brary called deptracker, which then wrote them out to 
an XML file when the process exited. An offline analy- 
sis step would then load all of these, detect conflicts, and 
(together with sequential build order information logged 
by an instrumented make tool) generate a supplemen- 
tary Makefile to augment the existing dependency graph. 
Initial performance evaluation with small builds showed 
that the time needed to load and process the XML files 
was a substantial portion of build time, as much as 30% 
of the build, suggesting that a coarser granularity of tasks 
and/or resources is needed to accelerate this stage. 

Another challenge for this prototype was the impracti- 
cality of maintaining a forked and instrumented codebase 
for every build tool used by a build, including many like 
gcc with much larger builds themselves than the build un- 
der evaluation — effective build wrappers could mitigate 
this problem. 

10 Conclusion and future work 

This work discussed design options for constructing a 
reliable build system and highlighted tradeoffs between 
them, but many of the ideas remain untested. A clear next 
step is building a complete build manager that can handle 
a real-world large build, including change detection and 
dependency inferrence, and measure overhead compared 
to existing solutions. Developing a meaningful perfor- 
mance testing method for incremental builds is another 
challenge. Expanding the model and giving design op- 
tions to support distributed builds would be valuable. 

Incorporating features of Vesta, such as a shared de- 
rived file cache and repeatable builds via integration with 
existing version control systems would be another in- 
triguing direction. Taking this to extremes, it may be 
valuable to have a "cloud cache" that shares derived 



files for building open-source projects among developers 
throughout the world. 

Some of the concepts that are useful for reliable build 
systems can also be applied in other domains. For ex- 
ample, because minimum information libraries allow re- 
source dependencies of code segments to be reliably and 
precisely identified, they can be used to compute infor- 
mation transfer from one portion of a program to another 
through shared state, which is often overlooked by dy- 
namic analysis tools. 

Finally, there is a great deal of practical work needed 
to get a functional reliable build system into the hands 
of everyday users, including supporting major tools and 
environments, providing an expressive build description 
language, and pushing for better change detection sup- 
port in mainstream kernels. 
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A Well-definedness of a valid configuration 

A configuration specifies the dependency graph and ini- 
tial shared state for a build. Recall that a valid build is 
one where, for any pair of conflicting tasks, there is a 
directed path from one to the other in the dependency 
graph. We begin by showing a lemma: 

Lemma A.l. // a given build is valid, any other valid 
build with the same configuration produces the same fi- 
nal result. 

Proof. Define the canonical access sequence as the se- 
quence obtained by fixing some topological order and 
executing each task sequentially in that order Given a 
valid build's access sequence, we will perform a series 
of swaps to transform it into the canonical sequence. 

Suppose two tasks are interleaved (neither performs all 
its accesses before those of the other). Then there is not a 
directed path between them in the dependency graph, and 

?Tc^ytgMFMi^ife^?tiwp*^y ™^s^ "°^ conflict. Hence we 
can safely swap accesses to ensure that the two tasks are 
no longer interleaved. By doing this for all pairs of tasks, 
we get a sequential schedule which performs all of each 
task in some order (ii, ^2, • ■ • , tn) which is a topological 
sort of the dependency graph. 

Any two tasks in a topological sort can be swapped 
unless there is an edge between them, and the result is 
still a topological sort. Such swaps can be used to trans- 
form the sequence into any other topological sort while 
preserving the final output, including the canonical se- 
quence. Hence any valid build's access sequence pro- 
duces the same final result: the result produced by the 
canonical sequence. D 
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We now generalize this to the stronger result: 

Theorem A.l. If a given build is valid, all builds with 
the same configuration are valid and produce the same 
final result. If a given build is invalid, all builds with the 
same configuration are invalid. 



Proof. By Lemma |A.1[ if all builds are valid, they all 
produce the same final result. It remains to show a single 
configuration cannot generate both a valid and an invalid 
build. 

Suppose we have an invalid build (ai, 02, ... , a„) and 
a valid build (&i, 62, • • • , ^n), both with a given access 
sequence. We will gradually transform the first into the 
second. 

We find the first point at which they diverge at / bi, 
locate aj such that aj = bi, and move it up to the ith po- 
sition by a series of swaps. If gj did not conflict with any 
of Ci+i, . . . , Qj-i, then the behavior of all tasks is pre- 
served: the new access sequence is a feasible build, and 
is valid if and only if the previous sequence was valid. 

Suppose on the other hand aj does conflict with at 
least one of Ci+i, . . . ,aj_i; let the first be «„. Be- 
cause swapping aj , a,„ may change task behavior, the 
build must be conceptually re-executed starting after Um 
to get a feasible new access sequence. In the previous it- 
eration, aj followed a„i, whereas in the current iteration 
Uj precedes a,„; this implies the two tasks owning these 
accesses have no directed path between them in the de- 
pendency graph. But Gj , «,„ conflict, so the new build is 
invalid. 

In either case, the common prefix of the two builds 
grows by at least one access with each iteration, and 
eventually the build (bk) is reached. However, in both 
cases the invalidity of the original build is preserved, so 
{bk) is invalid as well. This is a contradiction, so there 
cannot be both an invalid and a valid build. D 

This means the definition of a valid configuration as 
a pair (dependency graph, start state) producing valid 
builds is well-defined. 
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